For this question, we will examine a series of tools to perform essential quality-control analyses for high-throughput RNA sequencing data.
You are asked by a collaborator to analyze four RNA-seq libraries. She suspects that the libraries are generally of high-quality but is concerned that a sample may have been switched with her benchmates during processing. Execute FastQC, STAR, and RSeQC (tin.py) to determine whether any of the samples exhibit unusual quailty control metrics. Overall, identify the best and worst libraries. Your answer should provide evidence from all three tools. Include screen shots and tables as necessary as if you were delivering a report to the collaborator.
Sequencing data:
/n/stat115/2020/HW2/raw_data
modules: fastqc/0.11.8-fasrc01, STAR/2.6.0c-fasrc01
index: /n/stat115/2020/HW2/star_hg38_index
bed: /n/stat115/2020/HW2/hg38_RefSeq.bed
Hint: not required but helpful to run STAR with these parameters:
--outSAMtype BAM SortedByCoordinate
--readFilesCommand zcat
Student response
Results from FastQC
Quality scores, Sequence Content and GC distribution plot of the four libraries are displayed above (with order: W, X, Y, Z). We can see that in the sequence content plot that library Z has imbalanced base composition and failed this session. Also in the GC distribution plot, library Z’s GC count distribution deviates from the theoretical curve.
Results from STAR:
=====================================RunW==================================
Started job on | Feb 23 03:31:22
Started mapping on | Feb 23 03:33:04
Finished on | Feb 23 03:34:50
Mapping speed, Million of reads per hour | 101.89
Number of input reads | 3000000
Average input read length | 50
UNIQUE READS:
Uniquely mapped reads number | 2022651
Uniquely mapped reads % | 67.42%
Average mapped length | 49.82
Number of splices: Total | 236342
Number of splices: Annotated (sjdb) | 229910
Number of splices: GT/AG | 234284
Number of splices: GC/AG | 1510
Number of splices: AT/AC | 140
Number of splices: Non-canonical | 408
Mismatch rate per base, % | 0.29%
Deletion rate per base | 0.01%
Deletion average length | 1.29
Insertion rate per base | 0.01%
Insertion average length | 1.17
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 727449
% of reads mapped to multiple loci | 24.25%
Number of reads mapped to too many loci | 13729
% of reads mapped to too many loci | 0.46%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 7.65%
% of reads unmapped: other | 0.23%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
=====================================RunX==================================
Started job on | Feb 23 03:35:27
Started mapping on | Feb 23 03:35:44
Finished on | Feb 23 03:37:20
Mapping speed, Million of reads per hour | 112.50
Number of input reads | 3000000
Average input read length | 50
UNIQUE READS:
Uniquely mapped reads number | 2311502
Uniquely mapped reads % | 77.05%
Average mapped length | 49.83
Number of splices: Total | 324261
Number of splices: Annotated (sjdb) | 316057
Number of splices: GT/AG | 321601
Number of splices: GC/AG | 1963
Number of splices: AT/AC | 184
Number of splices: Non-canonical | 513
Mismatch rate per base, % | 0.25%
Deletion rate per base | 0.00%
Deletion average length | 1.44
Insertion rate per base | 0.01%
Insertion average length | 1.20
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 617312
% of reads mapped to multiple loci | 20.58%
Number of reads mapped to too many loci | 14722
% of reads mapped to too many loci | 0.49%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 1.65%
% of reads unmapped: other | 0.23%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
=====================================RunY==================================
Started job on | Feb 23 03:37:56
Started mapping on | Feb 23 03:38:13
Finished on | Feb 23 03:39:42
Mapping speed, Million of reads per hour | 121.35
Number of input reads | 3000000
Average input read length | 50
UNIQUE READS:
Uniquely mapped reads number | 2514248
Uniquely mapped reads % | 83.81%
Average mapped length | 49.82
Number of splices: Total | 335593
Number of splices: Annotated (sjdb) | 327213
Number of splices: GT/AG | 332661
Number of splices: GC/AG | 2145
Number of splices: AT/AC | 260
Number of splices: Non-canonical | 527
Mismatch rate per base, % | 0.24%
Deletion rate per base | 0.00%
Deletion average length | 1.45
Insertion rate per base | 0.01%
Insertion average length | 1.20
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 433378
% of reads mapped to multiple loci | 14.45%
Number of reads mapped to too many loci | 12249
% of reads mapped to too many loci | 0.41%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 1.06%
% of reads unmapped: other | 0.27%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
=====================================RunZ==================================
Started job on | Feb 23 03:40:15
Started mapping on | Feb 23 03:40:32
Finished on | Feb 23 03:44:04
Mapping speed, Million of reads per hour | 50.94
Number of input reads | 3000000
Average input read length | 50
UNIQUE READS:
Uniquely mapped reads number | 442168
Uniquely mapped reads % | 14.74%
Average mapped length | 48.10
Number of splices: Total | 168784
Number of splices: Annotated (sjdb) | 168479
Number of splices: GT/AG | 168728
Number of splices: GC/AG | 20
Number of splices: AT/AC | 3
Number of splices: Non-canonical | 33
Mismatch rate per base, % | 9.05%
Deletion rate per base | 0.00%
Deletion average length | 2.20
Insertion rate per base | 0.00%
Insertion average length | 1.53
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 572719
% of reads mapped to multiple loci | 19.09%
Number of reads mapped to too many loci | 825
% of reads mapped to too many loci | 0.03%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 66.14%
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
The uniquely mapped reads in library W, X, and Y are reasonable (67.42%, 77.05% and 83.81%), whereas the uniquely mapped reads in library Z is quite low (14.74%) and the mismatch rate per base is much higher (9.05%) than other libraries. Also there is a high percentage of reads unmapped (66.14%) in library Z. Library Y performs pretty well according to the STAR result, with the highest uniquely mapped reads percentage and lowest mismatch rate.
RSeQC results:
=====================================RunW==================================
Bam_file TIN(mean) TIN(median) TIN(stdev)
runW_sorted.bam 34.94160281773132 31.60982628740766 22.94113216430839
=====================================RunX==================================
Bam_file TIN(mean) TIN(median) TIN(stdev)
runX_sorted.bam 40.85057373281427 39.527069505651056 24.110339363400527
=====================================RunY==================================
Bam_file TIN(mean) TIN(median) TIN(stdev)
runY_sorted.bam 43.267014161444344 43.30489990415091 24.294023616775483
=====================================RunZ==================================
Bam_file TIN(mean) TIN(median) TIN(stdev)
runZ_sorted.bam 16.006341411923454 14.884227029485709 14.7148665428431
According to RSeQC, Library Z has the lowest Tin(mean) (16.006) and Library Y has the highest Tin(mean) (43.267). Based on the results from FastQC, RSeQC and STAR, we can conclude that library Y is the best library and library Z is the worst library.
Your collaborator recalls that one of her samples was left on the bench for a couple of days before the full RNA-seq library was processed. Using the metrics from the question above, can you identify this sample? Provide your rationale.
Student response
I suggest that this sample is sample Z because it has the worst quality out of the four. It is mainly because that leaving RNA samples for a couple of days will result in the degredation of RNA samples. The degredation of this sample is demonstrated by the abnormal GC distribution and low tin score.
Process the 4 sequencing libraries with Salmon introduced in the previous question. Identify the transcript and gene with the highest expression in each library from the Salmon output.
module: salmon/0.12.0-fasrc01
index: /n/stat115/2020/HW2/salmon_hg38_index
Student response
#!/bin/bash
#SBATCH -n 1 # Number of cores requested
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 20 # Runtime in minutes
#SBATCH -p serial_requeue # Partition to submit to
#SBATCH --mem=8G # Memory in GB (see also --mem-per-cpu)
module load salmon/0.12.0-fasrc01
index=/n/stat115/2020/HW2/salmon_hg38_index
path=/n/stat115/2020/HW2/raw_data/
files=("runW" "runX" "runY" "runZ")
for file in ${files[@]}; do
salmon quant -l A -i $index -r ${path}$file.fastq.gz --validateMappings -o transcript_${file}
cd /n/home06/stat115u20076/transcript_${file}
sort -n -k4 quant.sf > sorted_quant.sf
cd ..
done
Transcript and gene with highest expression in the four different libraries: runW: ENST00000361851 207 10.766 64997.019628 4108.748 runX: ENST00000361851 207 10.766 41996.698910 2796.513 runY: ENST00000361851 207 10.766 26495.853255 1370.680 runZ: ENST00000444587 307 58.716 323475.701112 12106.790
Report the relative speed of Salmon and STAR for the analyses of these four samples. Comment on your results based on the lecture material.
Hint: you can parse the times from log files or use the `time` tool in the command line...
e.g. time sleep 2
Student response
[stat115u20076@boslogin01 hw2]$ head -3 runW_star_outputLog.final.out Started job on | Feb 23 03:31:22 Started mapping on | Feb 23 03:33:04 Finished on | Feb 23 03:34:50 [stat115u20076@boslogin01 hw2]$ head -3 runX_star_outputLog.final.out Started job on | Feb 23 03:35:27 Started mapping on | Feb 23 03:35:44 Finished on | Feb 23 03:37:20 [stat115u20076@boslogin01 hw2]$ head -3 runY_star_outputLog.final.out Started job on | Feb 23 03:37:56 Started mapping on | Feb 23 03:38:13 Finished on | Feb 23 03:39:42 [stat115u20076@boslogin01 hw2]$ head -3 runZ_star_outputLog.final.out Started job on | Feb 23 03:40:15 Started mapping on | Feb 23 03:40:32 Finished on | Feb 23 03:44:04
Total time cost: Approximately 11 minutes in total
Speed of Salmon for the analyses of the for samples: RunW Start on: [2020-02-24 21:01:00.765] RunW End on: [2020-02-24 21:01:17.969]’ RunX Start on: [2020-02-24 21:01:20.313] RunX End on: [2020-02-24 21:01:39.428] RunY Start on: [2020-02-24 21:01:42.842] RunY End on: [2020-02-24 21:02:03.786] RunZ Start on: [2020-02-24 21:02:07.507] RunZ End on: [2020-02-24 21:02:17.130]
Total time cost: Approximately 1 minute.
Salmon pseudoalignment takes much less time than star alignment. This is mainly because that pseudoalignment does not align base-to-base. Instead, it directly compare the raw sequence reads with transcript sequences and then used to quanitfy transcript abundance, which greatly improve the speed.
Plot the relationship between effective length, normalized read counts, TPM, and FPKM for runX from the Salmon output. Comment on the relative utility of each metric when analyzing gene expression data.
Student response
Effective length: Effective length accounts for the number of possible start positions for a read or fragment within that transcript and is used for the calculation of TPM and FPKM.
Normalized read counts is used for the calculation of FPKM, while it doesn’t taken into account of gene length.
TPM and FPKM are similar except for the order of operation: in TPM we normalize for gene length first, and then normalize for sequencing depth second. In TPM, the sum of all TPMs in each sample are the same, making it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.
TPM is thus more commonly used in analyzing gene expression data.
In 2014, a controversial manuscript from Lin et al. argued that, based on RNA-seq of several tissues from both mouse and human, fundamental physiological differences existed between these two organisms. Here, we will investigate these claims for a subset of the data analyzed. (Note: a copy of this manuscript is included as the part3_4-manuscript.pdf file associated with this homework.) The provided data is a counts matrix of the samples with the following conventions:
row: <mouse>_<human> gene name (e.g. Stag2_STAG2)
column: <organism>_<tissue>_<batch> sample identifier (e.g. human_adipose_3)
Perform a principle component analysis of the samples using the top 5,000 most variable genes as features. Indicate the species, tissue, and batch per sample plot. Do the results support the conclusions of the original paper? Do the results suggest the presence of a batch effect?
library(stringr)
library(pheatmap)
library(ggplot2)
# Import processed raw counts
counts <- readRDS("../data/part3_counts.rds")
# Perform a log TPM normalization
log2tpm <- sapply(1:dim(counts)[2], function(idx){
log2((counts[,idx]/sum(counts[,idx]) * 1000000) + 1)
})
colnames(log2tpm) <- colnames(counts)
# Continue analysis here.
sample_data =str_split_fixed(colnames(log2tpm), "_", 3)
data.frame(sample_data)
## X1 X2 X3
## 1 human adipose 3
## 2 human brain 4
## 3 human heart 2
## 4 human kidney 2
## 5 human liver 2
## 6 human lung 3
## 7 human pancreas 4
## 8 human spleen 2
## 9 mouse adipose 1
## 10 mouse brain 4
## 11 mouse heart 5
## 12 mouse kidney 5
## 13 mouse liver 5
## 14 mouse lung 1
## 15 mouse pancreas 1
## 16 mouse spleen 4
sample = data.frame(str_split_fixed(colnames(log2tpm), "_", 3))
colnames(sample) = c("Species","Tissue","Batch")
rowVar <- function(x, ...){
rowSums((x-rowMeans(x,...))^2,...)/(dim(x)[2]-1)
}
threshold = sort(rowVar(log2tpm), decreasing = TRUE)[5000]
log2tpm_variable = log2tpm[rowVar(log2tpm) >= threshold,]
#head(log2tpm_variable)
# PCA on top 5,000 most variable genes
pheatmap(cor(log2tpm_variable))
pca_df = data.frame(sample, prcomp(log2tpm_variable, scale = TRUE, center = TRUE)$rotation)
ggplot(pca_df, aes(x=PC2, y=PC3, color = factor(Tissue), shape= factor(Batch), size = factor(Species))) + geom_point()
## Warning: Using size for a discrete variable is not advised.
Student response
The result supports the conclusion of the paper since in the PCA plot, genes in different tissue from the same species sit closer to each other than genes in the same tissue from different species. However, this plot suggests the possible batch effect because genes with same batch number(shape) tends to locate closer to each other, forming clusters. Also, batch number is correlated with species. We cannot determine whether the previously stated conclusion is due to batch effect or not.
Run COMBAT on the samples to remove the batch effect. Visualize the results using a similar principle component analysis as the question above. Provide evidence that the batch effects are successfully adjusted. Do these results change the primary interpretation of the results?
library(sva)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
## Loading required package: BiocParallel
## Warning: package 'BiocParallel' was built under R version 3.6.2
pca_df$Batch
## [1] 3 4 2 2 2 3 4 2 1 4 5 5 5 1 1 4
## Levels: 1 2 3 4 5
combat_dat <- ComBat(log2tpm_variable, pca_df$Batch, par.prior = TRUE)
## Found5batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
pheatmap(cor(combat_dat))
pca_combat_df <- data.frame(sample,
prcomp(combat_dat, scale = TRUE, center = TRUE)$rotation
)
head(pca_combat_df)
## Species Tissue Batch PC1 PC2 PC3
## human_adipose_3 human adipose 3 -0.2640517 0.24056857 -0.28431113
## human_brain_4 human brain 4 -0.2527535 -0.28527745 0.09753319
## human_heart_2 human heart 2 -0.2425055 -0.03695054 -0.46473639
## human_kidney_2 human kidney 2 -0.2599336 -0.16257776 0.27225216
## human_liver_2 human liver 2 -0.2186763 0.41059809 0.37025028
## human_lung_3 human lung 3 -0.2752891 -0.17256720 0.18897479
## PC4 PC5 PC6 PC7 PC8
## human_adipose_3 0.06360443 -0.04190041 0.1462182 -0.11503206 0.26953370
## human_brain_4 0.21072217 -0.04388635 -0.2203862 0.35694104 0.40249222
## human_heart_2 0.22408540 0.11841575 -0.2723406 -0.29286254 -0.05927706
## human_kidney_2 0.20159650 -0.36953849 0.1412844 -0.14021482 -0.12880328
## human_liver_2 0.12234034 0.27788012 -0.2203694 -0.05301013 -0.01594104
## human_lung_3 -0.20636755 -0.18294554 -0.2984436 0.06345698 -0.08319480
## PC9 PC10 PC11 PC12 PC13
## human_adipose_3 -0.50993607 0.3087867 -0.21584358 0.01067201 0.07935282
## human_brain_4 -0.37352568 -0.3052667 0.29478048 0.03801809 -0.24404311
## human_heart_2 -0.02951845 -0.3371521 -0.07180010 0.43820860 0.25082101
## human_kidney_2 -0.22042019 -0.1881649 -0.25968894 -0.51088768 0.30094491
## human_liver_2 -0.08128643 0.2983601 0.50938177 -0.03847482 0.28102295
## human_lung_3 0.33764677 -0.2190148 0.08206658 0.07313516 0.30781147
## PC14 PC15 PC16
## human_adipose_3 -0.12912913 0.40690568 -0.29644418
## human_brain_4 -0.18909341 -0.19797838 -0.07317112
## human_heart_2 0.17700204 -0.03488933 0.29828607
## human_kidney_2 0.10214870 -0.02971510 0.29008504
## human_liver_2 0.02130746 0.03914487 0.25525810
## human_lung_3 -0.12129213 0.50515676 -0.37941022
ggplot(pca_combat_df, aes(x=PC2, y=PC3, color = factor(Tissue), shape= factor(Batch), size = factor(Species))) + geom_point()
## Warning: Using size for a discrete variable is not advised.
Student response
Batch effect is removed because in the plot after correction, genes with different batch number are now distributed evenly in space rather than gather in clusters. Further evidence supports that batch effect is removed. In the heatmap, the correlation between gene expression in the same batch from different tissues is lower than before batch effect correction. The primary interpretation is however altered. We can observe that now data points with the same color sit closer to each other now regardless of marker size(species). Therefore, we might suspect that after batch effect correction, the true situation is gene expression in the same tissue in different species are more similar.
Run DESeq2 adjusting for the batch effect to identify differentially-expressed genes between the lung and adipose tissue. Report the number of statistically-significant genes as well as whether they are more highly expressed in either adipose tissue or lung tissue.
library(DESeq2)
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 3.6.2
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Warning: package 'IRanges' was built under R version 3.6.2
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:nlme':
##
## collapse
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Warning: package 'SummarizedExperiment' was built under R version 3.6.2
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Warning: package 'DelayedArray' was built under R version 3.6.2
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following objects are masked from 'package:genefilter':
##
## rowSds, rowVars
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply, rowsum
library(apeglm)
ddsMat = DESeqDataSetFromMatrix(countData = counts,
colData = sample,
design = ~ Species + Tissue + Batch)
ddsMat = DESeq(ddsMat)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## 1 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
rdf <- results(ddsMat,contrast = c("Tissue", "lung","adipose"))
head(rdf)
## log2 fold change (MLE): Tissue lung vs adipose
## Wald test p-value: Tissue lung vs adipose
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## Stag2_STAG2 2840.75800123366 1.59300733310916 0.418028298118098
## Stag1_STAG1 1135.48858879233 0.0492194531241867 0.417616976804961
## Gosr2_GOSR2 461.596589536106 -0.822892994304817 0.53828737157743
## 4933434E20Rik_C1orf43 1561.62011584001 1.21509862208803 0.431078587193546
## Art5_ART5 1.78123841544922 1.46937951851083 5.97922596432191
## Art1_ART1 25.6926544397484 3.71573222987284 5.98953690842474
## stat pvalue
## <numeric> <numeric>
## Stag2_STAG2 3.81076434365962 0.000138537783740325
## Stag1_STAG1 0.117857883797606 0.906180264455161
## Gosr2_GOSR2 -1.52872431670348 0.126332801443026
## 4933434E20Rik_C1orf43 2.81874038327604 0.00482124956649308
## Art5_ART5 0.245747447458689 0.805877740445531
## Art1_ART1 0.620370537269147 0.535013865111339
## padj
## <numeric>
## Stag2_STAG2 0.00395482382247241
## Stag1_STAG1 0.962053516565546
## Gosr2_GOSR2 0.323642767159089
## 4933434E20Rik_C1orf43 0.0418156317904143
## Art5_ART5 NA
## Art1_ART1 0.743589385128234
# Count statistically significant genes with adjusted p-value < 0.01
sum(rdf$padj < 0.01, na.rm = TRUE)
## [1] 667
# Determine whether these genes are more highly expressed in adpose tissue or lung tissue.
sum(rdf$padj < 0.01 & rdf$log2FoldChange > 0, na.rm=TRUE)
## [1] 268
sum(rdf$padj < 0.01 & rdf$log2FoldChange <= 0, na.rm=TRUE)
## [1] 399
dds_shrink <- lfcShrink(ddsMat, coef="Tissue_lung_vs_adipose", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:matrixStats':
##
## count
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following object is masked from 'package:nlme':
##
## collapse
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
rds_df <- data.frame(dds_shrink, gene = rownames(dds_shrink))
head(rds_df %>% arrange(desc(log2FoldChange)))
## baseMean log2FoldChange lfcSE pvalue padj gene
## 1 2528.8788 15.065076 2.597096 3.920427e-11 2.517894e-08 Scgb3a2_SCGB3A2
## 2 1517.5122 14.922357 1.772117 6.095269e-18 2.609791e-14 Cldn18_CLDN18
## 3 335.3639 11.732491 2.546682 1.400301e-11 1.199125e-08 Foxa2_FOXA2
## 4 1801.8251 10.193046 1.097325 3.135725e-22 4.027839e-18 Lamp3_LAMP3
## 5 310.5671 9.802070 1.157523 5.854876e-18 2.609791e-14 Slc6a14_SLC6A14
## 6 3700.2177 8.092307 5.008655 1.507068e-05 7.805760e-04 Cxcl15_IL8
log2tpm_variable[c('Scgb3a2_SCGB3A2','Cldn2_CLDN2'),]
## human_adipose_3 human_brain_4 human_heart_2 human_kidney_2
## Scgb3a2_SCGB3A2 0.1937716 0 0.04383398 0.000000
## Cldn2_CLDN2 2.5219917 0 0.00000000 5.815961
## human_liver_2 human_lung_3 human_pancreas_4 human_spleen_2
## Scgb3a2_SCGB3A2 0.000000 11.43139 0.2184518 2.956983
## Cldn2_CLDN2 5.420732 0.00000 2.6395794 0.000000
## mouse_adipose_1 mouse_brain_4 mouse_heart_5 mouse_kidney_5
## Scgb3a2_SCGB3A2 0.000000 0.000000 0 0.000000
## Cldn2_CLDN2 9.379971 1.894846 0 8.960572
## mouse_liver_5 mouse_lung_1 mouse_pancreas_1 mouse_spleen_4
## Scgb3a2_SCGB3A2 0.000000 10.754076 0.000000 0.00000000
## Cldn2_CLDN2 6.954886 1.337649 2.140597 0.06683525
# log2fc < 0 => upregulated in adipose
# log2fc > 0 => upregulated in lung
# Among the statistically significant genes, 268 w/ log2fc >0 and 399 w/ log2fc <0)
stat_sig_genes = (rds_df %>% filter(padj < 0.01))$gene
Student response
There are 667 significant differentially expressed genes, 268 of which upregulated in lung tissue and 399 upregulated in adipose. The significant de genes are more highly expressed in adipose.
Identify the top 5 most differentially expressed genes that are overexpressed in each of the tissues. Comment on the biological relevance of these. It may be useful to use data from the GTEx consortium when interpreting your result.
rds_df %>% arrange(padj) %>% head(5)
## baseMean log2FoldChange lfcSE pvalue padj gene
## 1 1801.8251 10.193046 1.097325 3.135725e-22 4.027839e-18 Lamp3_LAMP3
## 2 1517.5122 14.922357 1.772117 6.095269e-18 2.609791e-14 Cldn18_CLDN18
## 3 310.5671 9.802070 1.157523 5.854876e-18 2.609791e-14 Slc6a14_SLC6A14
## 4 1229.0663 -8.523320 1.112790 4.851228e-17 1.557851e-13 Cldn2_CLDN2
## 5 512.1478 7.759123 1.146407 5.690623e-15 1.461921e-11 Rtkn2_RTKN2
GTEx link: https://www.gtexportal.org/home
Student response
Top differentially expressed genes are Lamp3, Cldn18, Slc6a14, Cldn2 and Rtkn2. Up-regulated genes in lung tissue is associated with positive log2 fold change, and vice versa. LAMP3 is the abrreviation for lysosomal associated membrane protein 3. During human fetal development, between weeks 10 and 20, LAMP3 is highly expressed in the lungs, while in normal adult tissue cells LAMP3 is expressed in the lungs, appendix, testis and lymph nodes. According to GTEx consortium report, LAMP3 has a medium expression level around TPM 60.09. LAMP3 has a positive log2 fold change, which indicates it’s up-regulated in lung tissue, consistent with the result from GTEx. CLDN18 gene encodes a member of the claudin family- integral membrane proteins and components of tight junction strands. CLDN18 is biasedly expressed in stomach (TPM 391) and lung tissues (TPM 140.5), consistent with the positive log2 fold change. SLC6A14 encodes a member of the solute carrier family 6. Members of this family are sodium and chloride dependent neurotransmitter transporters. SLC6A14 is biasedly expressed in lung (TPM 8.798), minor salivary gland (TPM 29.57), consistent with the postive log2 fold change. CLDN2 has enhanced expression in brain, gallbladder, kidney, liver, seminal vesicle based on combined data from HPA, GTEX and FANTOM5, and has higher expression level in adipose than in lung according to GTEX, consistent with the negative log fold change. RTKN has biased expression in lung (RPKM 32.1) and testis (RPKM 2.2) according to HPA RNA expression database, consistent with a positive log2 fold change score.
Visualize the differential gene expression values by making a volcano and an MA plot to summarize the differences between the two tissues. Be sure to use the lfcShrink function to get more robust estimates of the fold-changes for genes.
Student response
library(EnhancedVolcano)
## Loading required package: ggrepel
plotMA(dds_shrink, ylim = c(-5, 5))
EnhancedVolcano(dds_shrink,
lab = rownames(dds_shrink),
x = 'log2FoldChange',
y = 'padj',
xlim =c(-10,10),
ylim = c(0,15))
Rerun differential gene expression analyses without accounting for the batch effect. Compare the number of differentially expressed genes and anecdotes of top differentially expressed genes. Are the numbers of differentially expressed genes before/after the batch effect consistent with what you would have expected? Comment on the biological relevance of the top genes.
ddsMat_wBatch = DESeqDataSetFromMatrix(countData = counts,
colData = sample,
design = ~ Species + Tissue)
ddsMat_wBatch = DESeq(ddsMat_wBatch)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
rdf_wbatch <- results(ddsMat_wBatch, contrast = c("Tissue", "lung", "adipose"))
sum(rdf_wbatch$padj < 0.01, na.rm = TRUE)
## [1] 244
sum(rdf_wbatch$padj < 0.01 & rdf_wbatch$log2FoldChange > 0, na.rm=TRUE)
## [1] 73
sum(rdf_wbatch$padj < 0.01 & rdf_wbatch$log2FoldChange <= 0, na.rm=TRUE)
## [1] 171
dds_wbatch_shrink <- lfcShrink(ddsMat_wBatch, coef="Tissue_lung_vs_adipose", type="apeglm")
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
rds_wbatch_df = data.frame(dds_wbatch_shrink,gene = rownames(dds_wbatch_shrink))
rds_wbatch_df %>% arrange(padj) %>% head(5)
## baseMean log2FoldChange lfcSE pvalue padj
## 1 3700.2177 14.966501 2.1490230 2.127792e-14 2.793152e-10
## 2 512.1478 8.028300 1.2095954 4.286136e-13 2.813206e-09
## 3 1801.8251 9.902613 1.5816502 8.552490e-12 3.742284e-08
## 4 174.8338 5.400333 0.8812385 2.111439e-11 6.929216e-08
## 5 522.0949 -8.834057 1.8233293 4.695969e-10 1.232880e-06
## gene
## 1 Cxcl15_IL8
## 2 Rtkn2_RTKN2
## 3 Lamp3_LAMP3
## 4 AA986860_C1orf116
## 5 Lep_LEP
Student response 73 genes with positive fold change => high in lung 171 genes with negative fold change => high in adipose The statistically significant genes are more highly expressed in adipose. Top differentially expressed genes are IL8, RTKN2, LAMP3, Clorf116, and LEP. There are some overlap between the significant genes without batch effect correction and those obtained with batch effect correction. Genes found significant in this run that were not significant previously include IL8, C1orf116 and LEP. Il8 is a chemokine produced by macrophages and other cell types such as epithelial cells, airway smooth muscle cells and endothelial cells. In mouse, this gene is called Cxcl15_IL8 and has restricted expression toward lung adult (RPKM 125.0). C1orf116 has biased expression in lung (RPKM 46.8), esophagus (RPKM 46.0) and 10 other tissues and expression in adipose was not detected. LEP is a common gene encoding a protein that is secreted by white adipocytes into the circulation and plays a major role in the regulation of energy homeostasis. LEP is found enriched in adipose tissue, breast and placenta. The top significant differentially expressed genes are biologically relevant to the tissue where these genes are upregulated and the relative expression in lung and adipose tissue in this study is consistent with the results from online gene database such as NCBI gene db and GTEX consortium. There are less statistically significant differentially expressed genes without batch effect correction. It is consistent with our expectation because we are expecting genes from the same tissue in mouse and human to be similar and genes from different tissues to be differentially expressed regardless of species. Since in the original manuscript, the potential batch effect confounds the relationship between expression of gene, types of tissue and species, they reached the conclusion that the expression for many sets of genes was found to be more similar in different tissues within the same species than between species. We are expecting that after batch effect removal, there should be more differentially expressed gene in different tissues within the same species. —–
While the previous question identified genes that were differentially expressed between tissues and specific anecdotes were used for interpretation, we often want to assesss the differences between samples using a more wholistic approach. Pathway enrichment analyses provide a statistically principled way of examining many differentially expressed genes in an effort to identify biological patterns that explain the results. These patterns are defined using prior biological knowledge.
Run the up and down regulated genes computed in problem III.3 separately on DAVID (http://david.abcc.ncifcrf.gov/) to see whether these genes are enriched in specific biological process, pathways, etc. For example, consider reporting the enrichments for the top 100 genes in the KEGG pathways. If you were to summarize the results in a paper, how would you describe the systematic biologial features that are different between these tissues? Your analysis should comment on the stability of enriched pathways (with at least 2 different input gene list sizes) and attempt to interpret the results in the differential physiological properties of the tissues.
Student response
gene_list = str_split_fixed(stat_sig_genes,'_',2)[,2]
sub_gene_list = gene_list[1:200]
write.table(gene_list,"stat_sig_genes.txt",sep=";",row.names = FALSE, quote = FALSE)
write.table(sub_gene_list,"stat_sig_genes_sub.txt",sep=";",row.names = FALSE, quote = FALSE)
part4_path = c("/Users/biankaursul/Workspace/Spring 2020/BST282/Homework_2020/HW2/data/KEGG_PATHWAY1.png","/Users/biankaursul/Workspace/Spring 2020/BST282/Homework_2020/HW2/data/KEGG_PATHWAY2.png","/Users/biankaursul/Workspace/Spring 2020/BST282/Homework_2020/HW2/data/GOBP1.png","/Users/biankaursul/Workspace/Spring 2020/BST282/Homework_2020/HW2/data/GOBP2.png")
knitr::include_graphics(part4_path)
For stability analysis, DAVID annotations for both the full significant gene list and a subset of the genes are included. As observed from the KEGG pathway annotations of the statistically significant genes computed in PartIII by DAVID, there are two major categories of pathways enriched with these genes: pathways related to metabolism and pathways related to cell-matrix adhesion. The pathways related to lipid and glucose metabolism include AMPK signaling pathway, PPAR signaling pathway, Fructose and mannose metabolism, Insulin signaling pathway and etc. The remaining pathways are mainly responsible for providing physical and mechanical support for specific tissues for the proper functions of the organs, including Focal adhesion pathway and ECM-receptor interaction. We further investigate the annotated the biological process under gene ontology and found similar results. The up/down regulated genes in part III are enriched in multiple biological process including cell adhesion, extracellular matrix organization, glucose metabolic process, cellular lipid metabolic process and fatty acid homeostasis. This result is consistent with what we are expecting because these specific genes enriched pathways and biological processes are closely related to the function of the tissues we’re investigating. For example, adipose tissue has responsibility for lipid metabolism and glucose homeostasis. It is expected that genes regulating lipid and/or glucose homeostasis would have a high expression level in adipose tissue. Also, we have noticed that there are some genes in our gene list enriched in pathways related to cel-matrix adhesion, which is also expected since the architecture of lung tissue is determined by the intricate pulmonary extracellular matrix (ECM). The formation of and dynamic changes within ECM requires certain genes regulating cell-matrix adhesion and the formation of connective tissue such as elastin and collagen to be expressed at a high level in lung tissues to support a special structure for air transmission. We also noticed that the gene list is related to more metabolic pathways than other pathways, which is consistent with the conclusion in Part III that the significant genes are more highly differentially expressed in adipose tissues. The results from the full significant gene list is similar to the results from the subset of genes except that less pathways related to genes upregulated in lungs were reported, suggesting that the result is reasonably stable. Therefore, we can conclude that differentially expressed genes in specific pathways support the differential physiological properties of the tissues.
Describe in at least 3 but no more than 7 sentences the methodological differences between how approaches like DAVID and approaches like GSEA work in identifying enriched pathways from RNA-seq data.
Student response
Approaches like DAVID determines the overlaps between user-supplied gene lists and the curated databases and searching for overlaps with very small probability if the overlaps occur purely due to chance. The provided genes list should be pre-selected to be statistically significant. In DAVID methodologies, a significance values such as FDR or FWER need to be specified and changing the significance level will alter the enrichment result. DAVID like methods use statistical tests including hypergeometric, Fisher’s exact and Chi-squared tests. In contrast, approaches like GSEA can use the whole gene list instead of a set of statistically significant genes, and uses every point in the list. For example, in GSEA we rank all the genes by differential expression, plot the cumulative fraction function of the genome-wide gene list and the user-supplied gene lists and test the separation of the two curves by Kolmogorov-Smirnov test. Therefore, in GSEA, we need to define the set before looking at the data and evaluate significance as a whole, contrary to examine a pre-selected significant gene list in DAVID methodology.
Run Gene Set Enrichment analysis (http://www.broadinstitute.org/gsea/index.jsp) using the summary statistics from problem III.3. What are the gene sets or experiments that best capture the differential expression data between these two cell types? Comment on the biological relevance of the results and compare them to the results produced from the DAVID analysis.
Hint: the fgsea package (https://bioconductor.org/packages/release/bioc/html/fgsea.html)
is, in my hands, easier to use than the original java distribution of gsea
Student response
library(fgsea)
## Loading required package: Rcpp
devtools::install_github("mw201608/msigdb")
## Skipping install of 'msigdb' from a github remote, the SHA1 (cc678598) has not changed since last install.
## Use `force = TRUE` to force installation
library("msigdb")
rank = rds_df %>% arrange(log2FoldChange) %>% select(gene, log2FoldChange) %>% mutate(hum_gene = str_split_fixed(gene,'_',2)[,2]) %>% select(hum_gene, log2FoldChange)
rank = setNames(rank$log2FoldChange,rank$hum_gene)
pathways = gmtPathways("/Users/biankaursul/Workspace/Spring 2020/BST282/Homework_2020/HW2/data/h.all.v7.0.symbols.gmt")
fgseaRes <- fgsea(pathways, rank, minSize=15, maxSize=500, nperm=1000)
## Warning in fgsea(pathways, rank, minSize = 15, maxSize = 500, nperm = 1000): There are ties in the preranked stats (2.33% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
# A positive NES will indicate that genes in set S will be mostly represented at the top of your list L. a negative NES will indicate that the genes in the set S will be mostly at the bottom of your list L.
head(fgseaRes[order(padj,pval),] %>% filter(NES >= 0 ), n=5)
## pathway pval padj ES NES
## 1 HALLMARK_PROTEIN_SECRETION 0.005376344 0.06105006 0.4929758 1.705547
## 2 HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.010101010 0.06105006 0.3861686 1.475961
## 3 HALLMARK_ALLOGRAFT_REJECTION 0.010204082 0.06105006 0.3658076 1.380867
## 4 HALLMARK_MYC_TARGETS_V1 0.010416667 0.06105006 0.3558244 1.338616
## 5 HALLMARK_KRAS_SIGNALING_UP 0.010416667 0.06105006 0.4508212 1.708234
## nMoreExtreme size
## 1 0 88
## 2 0 177
## 3 0 166
## 4 0 165
## 5 0 172
## leadingEdge
## 1 TSPAN8, ARFIP1, PAM, VAMP4, VAMP7, STX7, CAV2, LAMP2, BET1, RAB9A, SGMS1, SNX2, CLN5, STX12, VAMP3, SNAP23, ANP32E, ATP6V1H, TMX1, RAB14, YIPF6, RAB2A, TMED2, TOM1L1, VPS4B, GALC, STAM, GLA, TSG101, PPT1, ADAM10, RAB5A, M6PR, TPD52, ZW10, CLTA, COPB1, MAPK1, TMED10
## 2 OLR1, KYNU, IL7R, SDC4, CD44, EDN1, DRAM1, IER5, CD69, IL1B, IFIT2, IL18, TLR2, CCRL2, MARCKS, CXCL10, PNRC1, GCH1, ICAM1, MCL1, SGK1, PTGS2, ID2, CD80, NFE2L2, CD83, MXD1, TNFAIP6, TSC22D1, PLEK, TNFAIP8, PANX1, IFIH1, BIRC3, MAP3K8, ETS2, IER3, PLK2, KLF2, BMP2, NFKBIA, RIPK2, GPR183, KLF6, SOD2, KLF10, PTGER4, PDE4B, CEBPB, YRDC, BIRC2, SAT1, PLAUR, DNAJB4, F2RL1
## 3 IL12A, CTSS, IL1B, IL18, PTPRC, TLR2, LYN, LY86, IL18RAP, SRGN, GPR65, IGSF6, ABI1, CD47, UBE2D1, SOCS5, NCK1, ICAM1, FAS, IFNGR1, CD2, KLRD1, ST8SIA4, CD80, GZMA, BCL10, F2R, TGFB2, TLR1, GLMN, HLA-E, CD86, LCP2, CCR2, HLA-DOB, DARS, CD3G, RIPK2, CD3D, IL2RG, B2M, IL7, FGR, TPD52, CD8A
## 4 PTGES3, H2AFZ, VBP1, UBE2E1, YWHAQ, PCNA, COX5A, NAP1L1, PRDX4, CNBP, CCT4, GLO1, SET, VDAC3, PSMA4, SNRPD1, PSMD14, SNRPB2, DEK, EIF1AX, SNRPG, PSMA1, USP1, CLNS1A, SRSF7, PSMC6, SERBP1, TRA2B, HDAC2, PCBP1, IFRD1, DUT, G3BP1, PRPS2, PGK1, EIF2S2, EIF4E, STARD7, EIF4G2, COPS5, RFC4, HNRNPR, SLC25A3, ACP1, BUB3, SRSF1, RRM1, XPO1, MRPL9, CCNA2, PRDX3, RANBP1, SRSF2, ETF1, SSB, ABCE1, TXNL4A, EIF2S1, EEF1B2, RPL22, DHX15, ILF2, PSMA2, TCP1, LSM2, NDUFAB1, DDX18, PWP1
## 5 PRKG2, PPBP, TMEM100, PIGR, ARG1, CTSS, IL7R, GNG11, CXCR4, IL1B, CA2, TSPAN13, RABGAP1L, CXCL10, BTBD3, LAT2, PTPRR, FCER1G, LY96, PTGS2, ZNF277, ID2, SPP1, SPARCL1, GLRX, IL33, ADAM17, BPGM, EVI5, TSPAN1, BIRC3, BTC, TFPI, ZNF639, BMP2, IL2RG, TLR8, DUSP6, PDCD1LG2, ATG10, FBXO4, NIN, YRDC, PLAUR, F2RL1, LAPTM5, HKDC1, ITGA2, IKZF1, TRIB2, TNFAIP3, RELN
head(fgseaRes[order(padj,pval),] %>% filter(NES <= 0), n=5)
## pathway pval padj ES NES
## 1 HALLMARK_MYOGENESIS 0.001098901 0.02768549 -0.5458420 -1.680296
## 2 HALLMARK_ADIPOGENESIS 0.001107420 0.02768549 -0.5636344 -1.729789
## 3 HALLMARK_MYC_TARGETS_V2 0.002610966 0.04351610 -0.6101743 -1.617891
## 4 HALLMARK_ESTROGEN_RESPONSE_EARLY 0.040929204 0.17053835 -0.4287270 -1.314685
## 5 HALLMARK_MITOTIC_SPINDLE 0.053610503 0.18687708 -0.4196472 -1.293872
## nMoreExtreme size
## 1 0 185
## 2 0 179
## 3 1 50
## 4 36 174
## 5 48 188
## leadingEdge
## 1 PC, SMTN, ITGA7, GAA, AEBP1, CRAT, PLXNB2, TGFB1, IGF1, COL6A2, COL15A1, ADAM12, HDAC5, TSC2, STC2, SH2B1, PICK1, ITGB4, FLII, DMPK, MAPRE3, LPIN1, ACSL1, COL4A2, TPD52L1, APOD, FXYD1, MRAS, ENO3, ADCY9, CFD, BHLHE40, COL1A1, SIRT2, GSN, MYOM1, MYH11, PDLIM7, FGF2, MEF2D, OCEL1, NAV2, COL6A3, LSP1, SPEG, NCAM1, MYO1C, SPTAN1, CACNA1H, CAMK2B, KIFC3, ABLIM1, ITGB5, COL3A1
## 2 SNCG, LEP, ITGA7, APOE, PFKL, ADCY6, CRAT, SLC27A1, SCARB1, CYC1, PHLDB1, DGAT1, COL15A1, ETFB, MGLL, ELMOD3, FAH, SLC1A5, TKT, GPX4, ENPP2, FABP4, ECH1, PFKFB3, ITIH5, FZD4, GPAM, ESYT1, LPL, ITSN1, PEX14, UCP2, STAT5A, PREB, C3, ABCB8, SLC19A1, VEGFB, ACO2, BAZ2A, AGPAT3, IDH3G, UQCRC1, PPARG, PGM1, POR, EPHX2, LAMA4, ANGPTL4, ACADS, RETSAT, ELOVL6, NMT1
## 3 SLC29A2, FARSA, TCOF1, RRP9, MYBBP1A, TBRG4, NOP2, SRM, GRWD1, SLC19A1, IPO4, WDR74, MCM5, LAS1L, MAP3K6, PPRC1, PES1, HK2, PLK1, RCL1, RRP12, PUS1, IMP4, BYSL, SUPV3L1, NOP56, UTP20, AIMP2
## 4 PDZK1, GREB1, SCARB1, GFRA1, NCOR2, EGR3, STC2, ITPK1, MYBBP1A, FLNB, FKBP5, TPD52L1, ADCY9, AR, KDM4B, SLC24A3, CA12, BHLHE40, MED24, RAPGEFL1, WFS1, CELSR2, FASN, UNC119, PEX11A, NAV2, AFF1, DHRS3, TUBB2B, ABHD2, IGF1R, NBL1, RAB17, FAM102A, SLC27A2, CISH, ABLIM1, BLVRB, PPIF, HR, MPPED2, NADSYN1, SLC9A3R1, RRP12, CYP26B1, IGFBP4, INHBB, ABAT, ANXA9, FOXC1, CCND1, ELOVL2, CANT1, RHOBTB3, RPS6KA2, TMEM164, NRIP1, DHCR7, SYT12, PTGES, CXCL12, KCNK5, SLC37A1, TGM2, AKAP1, RHOD, RET
## 5 HDAC6, PLEKHG2, MAP3K11, FLNA, TUBGCP6, CEP250, SHROOM1, NUMA1, FLNB, MARK4, TUBGCP2, ITSN1, DYNC1H1, TBCD, SEPT9, ABR, CNTROB, ABL1, ARHGEF11, BCR, LLGL1, INCENP, CSNK1D, TAOK2, KPTN, ARHGDIA, ARFIP2, MYO9B, TSC1, CTTN, CYTH2, TUBA4A, ARHGEF7, GSN, PXN, CLASP1, ATG4B, CDK5RAP2, BCAR1, FSCN1, ARHGAP27, SPTAN1, SOS1, KIF22, CDC42EP4, CDC42BPA, ALMS1, PLK1, MYO1E, SMC1A, KNTC1, CLIP2, EPB41, MID1, TRIO, BIN1, SUN2, KIF2C, ACTN4, CENPF, ARHGAP10, KIF3B, KATNB1, ARHGAP4, PCNT
topUp <- fgseaRes[order(padj,pval),] %>%
filter(NES >= 0) %>%
top_n(10, wt=-padj)
topDown <- fgseaRes[order(padj,pval),] %>%
filter(NES < 0) %>%
top_n(10, wt=-padj)
topPathways <- bind_rows(topUp, topDown) %>%
arrange(-NES)
ggplot(topPathways, aes(x=reorder(pathway,NES), y=NES)) +
geom_bar(stat='identity', width=.5) +
scale_fill_manual(name="NES") +
labs(subtitle="Rank of Hallmark Pathways NES",
title= "Diverging Bars") +
coord_flip()
The top 20 significant Hallmark Pathways with most positive NES and most negative NES are ploted in the diverging bars. The ten significant pathways with positive normalized enrichment scores are more involved in lung related biological processes, and the ten pathways with negative NES are more involved in adipose related processes. For example, genes in the hallmark adipogenesis pathways (eg. FABP4, ADIPOQ, PPARG) are more expressed in adipose tissue according to GTEx and adipogenesis- the formation of adipocytes is clearly relevant to adipose tissues. Other pathways (NES < 0) including cholesterol homeostasis, estrogen response, bile acid metabolism and xenobiotic metabolism are specific to adipose tissue and close to the adipose relevant pathways annotated by DAVID. On the other end, the pathways on the top of the chart, including inflamatory response, KRAS signaling, IL6_JAK_STAT3 signaling, are more specific to lung tissue. For example, IL6/JAK/STAT3 signaling pathway suppresses lung cancer initiation through maintaining lung homeostasis and mutations in KRAS signaling pathway is correlated with lung cancers. Thus, upregulated genes in lung are more likely to be found in these pathways. In conclusion, the gene sets obtained from GSEA captured the differential expression of genes in the two tissues pretty well. Compared to DAVID analysis, GSEA was able to identify more significant results in both up-regulated and down-regulated directions and since it’s using the whole dataset, it is possibly better able to detect a set of genes experiencing subtle differential expressions than DAVID analysis.
RSeQC on RNA-seq generates many output files. One such file is called geneBodyCoverage.r which contains normalized reads mapped to each % of gene / transcript body. Suppose that we want to visualize all 12 samples from a recent RNA-seq library together to quickly perform quality control. These data files are present in the part5 folder. Write a python program to extract the values and name from each file. The same script should then draw the gene body coverage for all the samples (3 rows x 4 cols) in one figure. We provide an example with 3 x 2 samples in one figure. Include your code and final figure in your report.
Student response
import matplotlib.pyplot as plt
import string
Char = []
for i in range(ord('A'), ord('N')):
Char.append(chr(i))
Char.remove('I')
files = [f'../data/part5/{char}.geneBodyCoverage.r' for char in Char]
coverage = {}
for char,file in zip(Char,files):
f = open(file,"r")
raw_read = f.read()
coverage[char] = [float(i) for i in raw_read.split()[2].rstrip(')').lstrip('c(').split(',')]
fig, ax = plt.subplots(3,4, figsize = (20,15))
for (i,item) in enumerate(coverage):
ax.ravel()[i].plot(coverage[item])
ax.ravel()[i].set_title(item)
ax.ravel()[i].set_xlabel("Gene Body Percentile (5'-3')")
ax.ravel()[i].set_ylabel("Coverage")
plt.tight_layout()
plt.savefig("Part5 Output")
In a recent manuscript (published September 2019), Zhou et al. describe a modified version of RNA-seq called SILVER-seq that enables profiling of extracellular RNAs (exRNAs). The manuscript reports impressive performance in classifying patients with breast cancer compared to healthy controls as well as whether the cancer was recurrent. About three weeks ago, the original findings were challenged by Hartl and Gao where they argued that a batch effect confounded the interpretation of the work. The authors then rebutted the challenge.
In the main manuscript (see 1_original.pdf), what bioinformatics methods were used in conjuction with the SILVER-seq protocol to predict patient status? Name the methods and describe their purpose (list at least 3).
Student response 1. Standard RNA-seq: The purpose of carrying out standard RNA-seq is to obtain the exRNA expression profiles and compare it with SILVER-seq obtained exRNA expression profiles and then assess the variability of SILVER-seq measurements.
Principal Component Analysis: PCA is used to determine whether different types of RNAs exhibit the same power of differentiating cancer and normal samples. They first did a baseline PCA and found possible separation of cancer and noncancer samples by some subspaces. They then proceeded to test whether the degrees of cancer-normal separation are similar across different types of RNAs by another PCA analysis.
Differential Expression: The purpose is to test for differential expression of exRNAs between the serum samples collected from cancer and normal donors. The fold change and false discovery rate (FDR) for every exRNA were computed for that purpose. there were more exRNAs with higher expression in cancer (cancer-upregulated) than those with lower expression in cancer (cancer-downregulated) as compared to normal samples.
The next questions are for graduate students only. In 3-5 sentences each, answer the following questions related to the short letters that comment on the manuscript.
Briefly summarize the Hartl and Gao response (see 2_Hartl_response.pdf). Specifically, what evidence do Hartl and Gao offer that the interpretation of the original manuscript may be confounded by a batch effect? If you were the bioinformatician analyzing the original data, what steps, if any, could you take to eliminate the batch effect?
Student response The Hartl and Gao response suggest that potential batch effect weakens the stated results of the original paper. The reanalysis of the raw data demonstrated a perfect confound between read length and cancer status, and the PCA components separating cancer from normal tissues is correlated with alignment metrics. Therefore, they suspect that serum from individuals with cancer was processed separately from serum from individuals without cancer and they futher suggested that the separation observed by PCA analysis was likely due to batch effect rather than cancer status. If I were the bioinformatician analyzing the original data, I would apply methods like ComBat to correct for potential batch effect.
Summarize the response to Hartl (see 3_response_to_Hartl.pdf). Specifically, how do the original authors argue that there is “a lack of between batch differences”? Do you find their rebuttal convincing?
Student response They responded that read-length difference does not cause a significant batch effect because 1)the absolute difference of mismatch rates between the 50bp and 75bp reads of cancer and normal samples is within acceptable range and under the typically used threshold of the mapping software. 2) the distributions of the measured exRNA levels was similar between a cancer sample and a normal sample, indicating that every sample contains a similar proportion of highly expressed exRNAs. 3) the pairwise differences of the normal samples in exRNA levels nearly completely overlapped with the between-batch pairwise differences. I think their rebuttal convincing because in their original paper, they used different methods including standard rna-seq for comparison and control. Also their rebuttal is based on literature and factual comparisons of exRNA levels within and between the 2 batches.
Design a modified version of the study that utilizes both proper experimental setup and computational tools discussed in this lab/homework that would ameloirate the potential batch effect in assessing the efficacy of SILVER-seq. Comment specifically on batch design and analytical methods. Assume that you want to test the same number of samples (~130) but no more than 10 samples can be processed per batch.
Student response I would modify the original experiment by Zhou etc. by randomizing serum samples of normal and cancer tissues into multiple batch and for each sample record the batch number correspondingly. Before conducting PCA analysis, I would add one step adjusting for batch effect (e.g. ComBat) with the pre-recorded batch number column. After adjusting for batch effect, we can follow the remaining steps of the original study design and find out if there exists distinguishable separation in principal components between cancer and normal samples.
After reading the primary manuscript and both responses, how do you interpret the efficacy of SILVER-seq as a tool for cancer diagnostics/prediction? What remaining questions do you think need to be answered before this technology could be confidently used with patients?
Optional response (feedback will be given but students will not receive points.)
Note: you can access the raw SILVER-seq data from this study here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131512